This is the notebook for the Module 11 Programming Assignment.
Here are a few tips for using the iPython HTML notebook:
You should rename this notebook to be <your JHED id>_PR11.ipynb Do it right now.
Make certain the entire notebook executes before you submit it. (See Hint #5 above)
Change the following variables:
In [1]:
name = "Shehzad Qureshi"
jhed_id = "squresh6"
if name == "Student Name" or jhed_id == "sname1":
raise Exception( "Change the name and/or JHED ID...preferrably to yours.")
Add whatever additional imports you require here. Stick with the standard libraries and those required by the class. The import gives you access to these functions: http://ipython.org/ipython-doc/stable/api/generated/IPython.core.display.html (Copy this link) Which, among other things, will permit you to display HTML as the result of evaluated code (see HTML() or display_html()).
In [2]:
from IPython.core.display import *
from StringIO import StringIO
import numpy as np
from operator import itemgetter
from copy import deepcopy
For this assignment you will be implementing and evaluating a Decision Tree using the ID3 Algorithm (no pruning or normalized information gain). Use the provided pseudocode. This time you will need to fetch and parse the data yourself (although go ahead and look at the parse_data.py code from the previous assignments). The data is located at (copy link):
http://archive.ics.uci.edu/ml/datasets/Mushroom
You can download the two files and read them to find out the attributes, attribute values and class labels as well as their locations in the file.
One of the things we did not talk about in the lectures was how to deal with missing values. In C4.5, missing values were handled by treating "?" as an implicit attribute value for every attribute. For example, if the attribute was "size" then the domain would be ["small", "medium", "large", "?"]. Another approach is to skip instances with missing values. Yet another approach is to infer the missing value conditioned on the class. For example, if the class is "safe" and the color is missing, then we would infer the attribute value that is most often associated with "safe", perhaps "red". Pick an approach and implement it.
You should do 10 fold cross-validation and average your performance metric and show the standard deviation. Is accuracy the best performance metric for this problem? Hint: if we take poisonous == 1, then are false positives and false negatives equally important or do we care more about one than the other?
The implementation is left up to you (procedural or OOP) but you must provide a function that trains a Decision Tree from training data and provide a function that classifies a new instance given a Decision Tree.
Extra kudos if you can print your decision tree as ASCII or using NetworkX.
Don't forget to close with a section on what you learned from this Module
We will use the above dataset to train our ID3 algorithm to output a decision tree. We will then analyze the performance of our decision tree using k-fold cross-evaluation.
To start with, we ned to define our concept of the Decision Tree. We shall break the rules a bit and use a class since they are a bit easier to deal with.
Our Node class shall act as both the root node and a child node. It can store the name of the node (attribute or feature), value (classification label), or hold other nodes as a subtree.
There are two methods, getLeaf(), which returns a leaf node with values if it exists, and getSubtree(), which gets a specified subtree if it exists.
In [3]:
class Node(object):
def __init__(self, name, value=None, subtree=None):
self.name, self.children, self.value, self.subtree = name, [], value, subtree
def addChild(self, child, value=None):
self.children.append(child)
def getLeaf(self, leaf):
for c in self.children:
if c.name == leaf:
return c
return None
def getSubtree(self, subtree):
result = [x for x in self.children if x.value is None and len(x.children) > 0 and x.subtree == subtree]
return result[0] if len(result) > 0 else None
We need to be able to print out the tree for debugging and output purposes. The main function is print_tree_ascii which takes a tree and prints it out. Helper functions print_value parses the domains or classification labels, and tree_walk performs a recursive walk from the root to each node.
In [4]:
def print_value(value):
if type(value) == list:
return ', '.join([print_value(x) for x in value])
if type(value) == tuple:
return ' - '.join([str(x) for x in value])
else: return value
In [5]:
def tree_walk(tree, level=0):
ret = '{0}{1} : '.format(' '*level, str(tree.name))
if tree.subtree is not None: ret += '{0}\n'.format(str(tree.subtree))
if tree.value is not None: ret += '{0}\n'.format(print_value(tree.value))
if len(tree.children) > 0:
for c in tree.children:
ret += tree_walk(c, level+1)
return ret
In [6]:
def print_tree_ascii(tree):
print tree_walk(tree)
In [7]:
a = Node(5, subtree='b')
b = Node(1, subtree='p')
a.addChild(b)
c = Node('p', value=('yes', 155))
b.addChild(c)
d = Node('f', subtree='r')
b.addChild(d)
print_tree_ascii(a)
print b.getLeaf('p')
print a.getSubtree('p')
We're going to be testing as we go along so we will load the data earlier rather than later. As usual, we can use Numpy's super fast loadtxt to put everything into a 2D array.
In [8]:
data = np.loadtxt('agaricus-lepiota.data', delimiter=',', dtype='S1')
In [9]:
data[0]
Out[9]:
We need to test if our dataset is empty. This could happen, in theory, as we continue to partition our dataset. In practice I have yet to see it return true.
In [10]:
def data_is_empty(data):
return len(data) == 0
In [11]:
print data_is_empty(data) == False
print data_is_empty(np.array([])) == True
This is the heart behind the ID3 algorithm. It calculates the entropy of a certain dataset using the formula below:
$E(S) = -\sum{p_i\log_2{p_i}}$
where $p_i$ is the ratio of each occurence.
In [12]:
def calc_entropy(total, counts):
e = -np.sum([(float(x)/total)*np.log2(float(x)/total) for x in counts if x != 0])
return e
In [13]:
calc_entropy(7, [4, 3])
Out[13]:
This is a shortcut for determining the domain values of an attribute or feature. This eliminuates the need to input them beforehand but causes a problem when a new domain value is introduced.
In [14]:
def get_domain_values(data, col):
d = np.unique(data[:, col])
return d
In [15]:
get_domain_values(data, 0)
Out[15]:
This function gives us the total number of instances that meet a certain criteria. We can search by column and domain value and cross-correlate to another. This is necessary for calculating weighted entropy and information gain.
In [16]:
def get_datacount(data, col, value, y_col=None, y_value=None):
a = (data[:, col] == value)
if y_col is None or y_value is None:
return a.sum()
b = (data[:, y_col] == y_value)
c = np.logical_and(a, b)
return c.sum()
In [17]:
get_datacount(data, 0, 'e')
Out[17]:
We need to determine if the partitioned dataset is homogeneous or not. We could do this by just counting the domain values and seeing if there is only one unique value, but a better way is to calculate the entropy. An entropy value of 0 means our dataset is homogenous.
In [18]:
def data_is_homogeneous(data):
domain = get_domain_values(data, 0)
total = len(data[:,0])
e = calc_entropy(total, [get_datacount(data, 0, dval) for dval in domain])
return e == 0.0
In [19]:
class_p_only = data[np.where(data[:,0] == 'e')]
print data_is_homogeneous(class_p_only) == True
print data_is_homogeneous(data) == False
We will need to partition our data as we progress through the algorithm. This function will allow us to select all data which matches a certain value in a column.
In [20]:
def reduce_dataset_by_value(data, col, value):
dataset = data[np.where(data[:,col] == value)]
return dataset
In [21]:
reduce_dataset_by_value(data, 0, 'e').shape
Out[21]:
We also need to determine the majority classification label for a certain algorithm. This is done by segregating the columns by each domain value and then counting the classification labels for the largest domain value.
In [22]:
def majority_label(data, col):
domain = get_domain_values(data, col)
freqs = [(dval, get_datacount(data, col, dval)) for dval in domain]
attrib_majority = sorted(freqs, key=itemgetter(1), reverse=True)
subset = reduce_dataset_by_value(data, col, attrib_majority[0][0])
y_domain = get_domain_values(subset, 0)
y_freqs = [(dval, get_datacount(data, 0, dval)) for dval in y_domain]
class_majority = sorted(y_freqs, key=itemgetter(1), reverse=True)
return attrib_majority[0][0], class_majority
In [23]:
d = reduce_dataset_by_value(data, 5, 'a')
print data_is_homogeneous(d)
majority_label(d, 5)
Out[23]:
Now we need to calculate our weighted entropy. This is done by calculating the entropy for each domain value then weighing them against the ratio of occurences.
In [24]:
def get_weighted_entropy(data, col, y_col):
y_domain, y_total = get_domain_values(data, y_col), float(len(data[:, y_col]))
a_domain, a_total = get_domain_values(data, col), float(len(data))
a_freqs = [get_datacount(data, col, aval) for aval in a_domain]
entropies = [calc_entropy(get_datacount(data, col, dval),
[get_datacount(data, col, dval, y_col, yval) for yval in y_domain])
for dval in a_domain]
entropy = np.sum([a_freqs[i]/a_total*entropies[i] for i,_ in enumerate(a_freqs)])
return entropy
In [25]:
get_weighted_entropy(data, 1, 0)
Out[25]:
After calculating weighted entropy for each attribute, we can now calculate the information gain. This function will pick the attribute with the highest information gain.
$G(S,A) = E(S) - \sum_{v \subset V_A} {|S_v| \over |S|} \cdot {E(S_v)}$
where $E(S)$ is the entropy of the class column and the latter part of the equation is the entropy of the domain values.
In [26]:
def pick_best_attribute(data, attributes, ycol):
y_domain, y_total = get_domain_values(data, ycol), len(data[:,ycol])
y_entropy = calc_entropy(y_total, [get_datacount(data, ycol, dval) for dval in y_domain])
ig = [(i, y_entropy - get_weighted_entropy(data, i, ycol)) for i in attributes]
best = sorted(ig, key=itemgetter(1), reverse=True)[0]
return best[0]
In [27]:
pick_best_attribute(data, range(1, 2), 0)
Out[27]:
We will attempt to use something we learned from CSP and implement a method to order the domain values that we search through in order to reduce the size of the tree. The domain values are in order of increasing entropy since lower entropies tend to be more homogenous. The effectiveness of this implementation was not measured.
In [28]:
def order_domain_values(data, col):
domain, total = get_domain_values(data, col), float(len(data[:, col]))
freq = [(dval, calc_entropy(total, [get_datacount(data, col, dval)])) for dval in domain]
order = sorted(freq, key=itemgetter(1))
return zip(*order)[0]
In [29]:
order_domain_values(data, 5)
Out[29]:
Our ID3 algorithm follows. It is a simple recursive algorithm that loops through each attribute and its respective domain and creates a tree based on entropy calculations. Since our dataset has some unknown domain values, '?', we will skip training for this when we come across it.
In [30]:
def id3(data, attributes, default_label, ycol=0, subtree=None):
if data_is_empty(data): return default_label
domain_value, classification = majority_label(data, ycol)
if data_is_homogeneous(data): return Node(domain_value, value=classification)
if len(attributes) == 0: return Node(domain_value, value=classification)
best_attribute = pick_best_attribute(data, attributes, ycol)
m = majority_label(data, best_attribute)
node = Node(best_attribute, subtree=subtree if subtree is not None else m[0])
attributes.remove(best_attribute)
for domain_value in order_domain_values(data, best_attribute):
if domain_value == '?': continue
data_subset = reduce_dataset_by_value(data, best_attribute, domain_value)
child = id3(data_subset, deepcopy(attributes), node.value, ycol=best_attribute, subtree=domain_value)
node.addChild(child)
return node
Now we will test our algorithm against the dataset.
In [31]:
tree = id3(data, range(1, 23), 'p')
print_tree_ascii(tree)
Classification poses an interesting problem because we have to traverse the tree in order to find the class label. A lot of thought, trial and error went into designing the Node class and methods to make this as simple as possible.
Our classification algorithm uses a simple backtracking-search through the tree. It tests each node against the domain value and traverse the child nodes until it finds a class label. If it doesn't it backtracks until it does. Inference and other checks were not implemented at this time due to time constraints. If the tree and traversal algorithm works like it should, backtracking should not even be necessary.
In [32]:
def classify_backtrack(tree, instance, level=0, debug=False):
if debug: print '{0}{1}'.format(' '*level, tree.name)
if type(tree.name) == int:
value = instance[int(tree.name) - 1]
leaf = tree.getLeaf(value)
if leaf is not None:
vals = sorted(leaf.value, key=itemgetter(1), reverse=True)
if debug: print '{0}{1} : {2}'.format(' '*(level+1), leaf.name, print_value(vals[0]))
return True, vals[0][0]
subtree = tree.getSubtree(value)
if subtree is not None:
result, classification = classify_backtrack(subtree, instance, level+1)
if result: return True, classification
return False, None
In [33]:
def classify(tree, instance):
result, classification = classify_backtrack(tree, instance)
if result: return classification
else: return None
Now we test our classification method.
In [34]:
i = 4
instance = data[i][1:]
print data[i]
classify(tree, instance)
Out[34]:
In [35]:
def partition(dataset, fold, k):
segments = np.array_split(dataset, k)
test = segments[fold-1]
training = [segments[i] for i in xrange(k) if i != (fold-1)]
return np.concatenate(training), test
In [36]:
partition(data, 1, 10)[0].shape
Out[36]:
In [37]:
def split_data(dataset):
x = dataset[:,1:]
y = dataset[:,0]
return x, y
In [38]:
split_data(data)[1].shape
Out[38]:
The output of our validation will be a tuple of prediction and truth. This is necessary because we need to do more than just averaging of error as we will show below.
In [39]:
def cross_validation(dataset, default_label, folds=10):
output = []
attributes = range(1, 23)
for i in xrange(1, folds+1):
training_set, test_set = partition(dataset, i, folds)
tree = id3(training_set, attributes, default_label)
x, y = split_data(test_set)
y_h = np.array([classify(tree, xval) for xval in x])
output.append((y, y_h))
return output
For all classification problems is is very helpful to show the confusion matrix. This matrix contains the True/False Positives and True/False Negative and gives insight as to how we're doing. With this data we can not only calculate the accuracy, but the precision, recall and error margin as well, which are all very good metrics.
In [40]:
def confusion_matrix(y, y_h, pos, neg):
total = len(y)
tp = np.logical_and((y == pos), (y_h == pos)).sum()
tn = np.logical_and((y == neg), (y_h == neg)).sum()
fp = np.logical_and((y == pos), (y_h == neg)).sum()
fn = np.logical_and((y == neg), (y_h == pos)).sum()
return tp, tn, fp, fn
In [41]:
def print_confusion_matrix(tp, tn, fp, fn, pos, neg):
html = '''
<table>
<tr><td> </td><td>{0}</td><td>{1}</td></tr>
<tr><td>{2}</td><td>{3}</td><td>{4}</td></tr>
<tr><td>{5}</td><td>{6}</td><td>{7}</td></tr>
'''.format(pos, neg, pos, tp, fp, neg, fn, tn)
h = HTML(html)
return h
In [42]:
def print_test_stats(total, tp, tn, fp, fn):
accuracy = (tp + tn)/float(total)
error = (fn + fp)/float(total)
precision = float(tp)/(tp + fp)
recall = float(tp)/(tp + fn)
classified = tp + tn + fp + fn
print "total:", total
print "total classified: {0} ({1:.3%})".format(classified, classified/float(total))
print "total unclassified: {0} ({1:.3%})".format(total - classified, (total - classified)/float(total))
print "accuracy: {0:.3%}".format(accuracy)
print "error: {0:.3%}".format(error)
print "precision: {0:.3%}".format(precision)
print "recall: {0:.3%}".format(recall)
This is our test for 10-fold cross validation. Analysis will follow.
In [43]:
dataset = np.loadtxt('agaricus-lepiota.data', delimiter=',', dtype='S1')
np.random.shuffle(dataset)
%time output = cross_validation(dataset, 'p')
In [44]:
cm = np.array([confusion_matrix(y, y_h, 'p', 'e') for y, y_h in output])
tp, tn, fp, fn = cm.sum(axis=0)
total = 0
for _, y_h in output:
total += len(y_h)
print_test_stats(total, tp, tn, fp, fn)
print_confusion_matrix(tp, tn, fp, fn, 'p', 'e')
Out[44]:
An accuracy of above 99% and precision of above 99% is fantastic! Just for the sake of argument, let's see if flipping the positive and negative values have any effect.
In [45]:
%time output = cross_validation(dataset, 'e')
cm = np.array([confusion_matrix(y, y_h, 'e', 'p') for y, y_h in output])
tp, tn, fp, fn = cm.sum(axis=0)
total = 0
for _, y_h in output:
total += len(y_h)
print_test_stats(total, tp, tn, fp, fn)
print_confusion_matrix(tp, tn, fp, fn, 'e', 'p')
Out[45]:
So the results are more or less the same.
So part of the results above show that some instances are not classified. This seems to be contributing to the error rate as about half a percent seem to miss the mark. Introspection revealed that this had to do with some missing domain values in some of the data ('?' values).
Random forests may be the solution to that. The idea behind that is to train many, many trees with randomized population samples, then run the classifier against all the trees. The final classification will be the maximum value of all labels.
Let's start with a function to determine the maximum value of the output. We need to take care and handle those unclassified labels as necessary.
In [46]:
def max_value(values):
freq = [(v, (values == v).sum()) for v in np.unique(values) if v is not None]
return sorted(freq, key=itemgetter(1), reverse=True)[0][0]
In [47]:
a = np.array(['e', 'p', 'e', 'p', 'e', 'e', None])
max_value(a)
Out[47]:
Shortcut function to classify from a forest. Easier than using list comprehension.
In [48]:
def classify_forest(forest, instance):
values = np.array([classify(tree, instance) for tree in forest])
return max_value(values)
We'll rewrite our cross validation function to use forests. We'll default to 100 trees which should be a good amount to start with.
In [49]:
def rf_cross_validation(dataset, default_label, folds=10, forest_size=100):
output, forest = [], []
attributes = range(1, 23)
for i in xrange(1, folds+1):
training_set, test_set = partition(dataset, i, folds)
for j in xrange(forest_size):
data = training_set[np.random.randint(0, training_set.shape[0], size=training_set.shape[0])]
tree = id3(data, attributes, default_label)
forest.append(tree)
x, y = split_data(test_set)
y_h = np.array([classify_forest(forest, xval) for xval in x])
output.append((y, y_h))
return output
In [50]:
%time forest_output = rf_cross_validation(dataset, 'p')
In [51]:
cm = np.array([confusion_matrix(y, y_h, 'p', 'e') for y, y_h in forest_output])
tp, tn, fp, fn = cm.sum(axis=0)
total = 0
for _, y_h in forest_output:
total += len(y_h)
print_test_stats(total, tp, tn, fp, fn)
print_confusion_matrix(tp, tn, fp, fn, 'p', 'e')
Out[51]:
Seriously amazing.
ID3 is a fairly simple algorithm for generating trees based on input data. It's recursive nature allows it to build the tree from the ground up. Care must be taken to choose the appropriate representation for the tree because generating the tree is only half the battle. A classification method must be implemented as well that walks through the tree and outputs the correct classification label.
There is more to the algorithm than just generating it. As test results showed above, the error rate increases when the data is not so great. One problem with ID3 is that it is prone to overfitting. There are methods used to prevent this, such as pruning and CHAID, neither of which were implemented.
One method that was implemented was Random Forests. This involves generating lots of trees using randomized population samples, then training them against the data. The idea is that no tree will be exactly alike. Classification is done against every tree in the forest and the maximum label value is chosen. As a test showed above, the result is a remarkable improvement.
In [51]: